NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Beehive: A Flexible Network Stack for Direct-Attached Accelerators

https://doi.org/10.1109/MICRO61859.2024.00037

Lim, Katie; Giordano, Matthew; Stavrinos, Theano; Zhang, Irene; Nelson, Jacob; Kasikci, Baris; Anderson, Thomas (November 2024, IEEE)

Full Text Available
TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches

Shah, Aashaka; Chidambaram, Vijay; Cowan, Meghan; Maleki, Saeed; Musuvathi, Madan; Mytkowicz, Todd; Nelson, Jacob; Saarikivi, Olli; Singh, Rachee (April 2023, USENIX)

Machine learning models are increasingly being trained across multiple GPUs and servers. In this setting, data is transferred between GPUs using communication collectives such as ALLTOALL and ALLREDUCE, which can become a significant bottleneck in training large models. Thus, it is important to use efficient algorithms for collective communication. We develop TACCL, a tool that enables algorithm designers to guide a synthesizer into automatically generating algorithms for a given hardware configuration and communication collective. TACCL uses a novel communication sketch abstraction to get crucial information from the designer to significantly reduce the search space and guide the synthesizer towards better algorithms. TACCL also uses a novel encoding of the problem that allows it to scale beyond single-node topologies. We use TACCL to synthesize algorithms for three collectives and two hardware topologies: DGX-2 and NDv2. We demonstrate that the algorithms synthesized by TACCL outperform the Nvidia Collective Communication Library (NCCL) by up to 6.7x. We also show that TACCL can speed up end-to-end training of Transformer-XL and BERT models by 11%–2.3x for different batch sizes.
more » « less
Full Text Available
A Cloud-Based Data Storage and Visualization Tool for Smart City IoT: Flood Warning as an Example Application

https://doi.org/10.3390/smartcities6030068

Leal Sobral, Victor Ariel; Nelson, Jacob; Asmare, Loza; Mahmood, Abdullah; Mitchell, Glen; Tenkorang, Kwadwo; Todd, Conor; Campbell, Bradford; Goodall, Jonathan L. (June 2023, Smart Cities)

Collecting, storing, and providing access to Internet of Things (IoT) data are fundamental tasks to many smart city projects. However, developing and integrating IoT systems is still a significant barrier to entry. In this work, we share insights on the development of cloud data storage and visualization tools for IoT smart city applications using flood warning as an example application. The developed system incorporates scalable, autonomous, and inexpensive features that allow users to monitor real-time environmental conditions, and to create threshold-based alert notifications. Built in Amazon Web Services (AWS), the system leverages serverless technology for sensor data backup, a relational database for data management, and a graphical user interface (GUI) for data visualizations and alerts. A RESTful API allows for easy integration with web-based development environments, such as Jupyter notebooks, for advanced data analysis. The system can ingest data from LoRaWAN sensors deployed using The Things Network (TTN). A cost analysis can support users’ planning and decision-making when deploying the system for different use cases. A proof-of-concept demonstration of the system was built with river and weather sensors deployed in a flood prone suburban watershed in the city of Charlottesville, Virginia.
more » « less
Full Text Available
Xenic: SmartNIC-Accelerated Distributed Transactions

https://doi.org/10.1145/3477132.3483555

Schuh, Henry N.; Liang, Weihao; Liu, Ming; Nelson, Jacob; Krishnamurthy, Arvind (October 2021, The ACM SIGOPS 28th Symposium on Operating Systems Principles)

Full Text Available
RedPlane: enabling fault-tolerant stateful in-switch applications

https://doi.org/10.1145/3452296.3472905

Kim, Daehyeok; Nelson, Jacob; Ports, Dan R.; Sekar, Vyas; Seshan, Srinivasan (August 2021, Proceedings of the 2021 ACM SIGCOMM 2021 Conference)
null (Ed.)
Many recent efforts have demonstrated the performance benefits of running datacenter functions (e.g., NATs, load balancers, monitoring) on programmable switches. However, a key missing piece remains: fault tolerance. This is especially critical as the network is no longer stateless and pure endpoint recovery does not suffice. In this paper, we design and implement RedPlane, a fault-tolerant state store for stateful in-switch applications. This provides in-switch applications consistent access to their state, even if the switch they run on fails or traffic is rerouted to an alternative switch. We address key challenges in devising a practical, provably correct replication protocol and implementing it in the switch data plane. Our evaluations show that RedPlane incurs negligible overhead and enables end-to-end applications to rapidly recover from switch failures.
more » « less
Full Text Available
Bundled references: an abstraction for highly-concurrent linearizable range queries

https://doi.org/10.1145/3437801.3441614

Nelson, Jacob; Hassan, Ahmed; Palmieri, Roberto (February 2021, Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming)
null (Ed.)
Full Text Available
On the Performance Impact of NUMA on One-sided RDMA Interactions

https://doi.org/10.1109/icdcs47774.2020.00194

Nelson, Jacob; Palmieri, Roberto (December 2020, 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS))
null (Ed.)
Full Text Available
KVCG: a heterogeneous key-value store for skewed workloads

https://doi.org/10.1145/3456727.3463779

Miller, dePaul; Nelson, Jacob; Hassan, Ahmed; Palmieri, Roberto (January 2021, Proceedings of the 14th ACM International Conference on Systems and Storage)
null (Ed.)
Full Text Available
Pegasus: Tolerating Skewed Workloads in Distributed Storage with In-Network Coherence Directories

Li, Jialin; Nelson, Jacob; Michael, Ellis; Jin, Xin; Ports, Dan R. (November 2020, 14th USENIX Symposium on Operating Systems Design and Implementation)
null (Ed.)
High performance distributed storage systems face the challenge of load imbalance caused by skewed and dynamic workloads. This paper introduces Pegasus, a new storage system that leverages new-generation programmable switch ASICs to balance load across storage servers. Pegasus uses selective replication of the most popular objects in the data store to distribute load. Using a novel in-network coherence directory, the Pegasus switch tracks and manages the location of replicated objects. This allows it to achieve load-aware forwarding and dynamic rebalancing for replicated keys, while still guaranteeing data coherence and consistency. The Pegasus design is practical to implement as it stores only forwarding metadata in the switch data plane. The resulting system improves the throughput of a distributed in-memory key-value store by more than 10x under a latency SLO -- results which hold across a large set of workloads with varying degrees of skew, read/write ratio, object sizes, and dynamism.
more » « less
Full Text Available
Performance Evaluation of the Impact of NUMA on One-sided RDMA Interactions

https://doi.org/10.1109/SRDS51746.2020.00036

Nelson, Jacob; Palmieri, Roberto (January 2020, International Symposium on Reliable Distributed Systems, SRDS 2020, Shanghai, China, September 21-24, 2020)

Full Text Available

« Prev Next »

Search for: All records